Computation and Language [92]
☆ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
In this work, we introduce Mini-Gemini, a simple and effective framework
enhancing multi-modality Vision Language Models (VLMs). Despite the
advancements in VLMs facilitating basic visual dialog and reasoning, a
performance gap persists compared to advanced models like GPT-4 and Gemini. We
try to narrow the gap by mining the potential of VLMs for better performance
and any-to-any workflow from three aspects, i.e., high-resolution visual
tokens, high-quality data, and VLM-guided generation. To enhance visual tokens,
we propose to utilize an additional visual encoder for high-resolution
refinement without increasing the visual token count. We further construct a
high-quality dataset that promotes precise image comprehension and
reasoning-based generation, expanding the operational scope of current VLMs. In
general, Mini-Gemini further mines the potential of VLMs and empowers current
frameworks with image understanding, reasoning, and generation simultaneously.
Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)
from 2B to 34B. It is demonstrated to achieve leading performance in several
zero-shot benchmarks and even surpasses the developed private models. Code and
models are available at https://github.com/dvlab-research/MiniGemini.
comment: Code and models are available at
https://github.com/dvlab-research/MiniGemini
☆ Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation LREC
The rise of Modular Deep Learning showcases its potential in various Natural
Language Processing applications. Parameter-efficient fine-tuning (PEFT)
modularity has been shown to work for various use cases, from domain adaptation
to multilingual setups. However, all this work covers the case where the
modular components are trained and deployed within one single Pre-trained
Language Model (PLM). This model-specific setup is a substantial limitation on
the very modularity that modular architectures are trying to achieve. We ask
whether current modular approaches are transferable between models and whether
we can transfer the modules from more robust and larger PLMs to smaller ones.
In this work, we aim to fill this gap via a lens of Knowledge Distillation,
commonly used for model compression, and present an extremely straightforward
approach to transferring pre-trained, task-specific PEFT modules between
same-family PLMs. Moreover, we propose a method that allows the transfer of
modules between incompatible PLMs without any change in the inference
complexity. The experiments on Named Entity Recognition, Natural Language
Inference, and Paraphrase Identification tasks over multiple languages and PEFT
methods showcase the initial potential of transferable modularity.
comment: Accepted at LREC-COLING 2024
☆ Projective Methods for Mitigating Gender Bias in Pre-trained Language Models
Mitigation of gender bias in NLP has a long history tied to debiasing static
word embeddings. More recently, attention has shifted to debiasing pre-trained
language models. We study to what extent the simplest projective debiasing
methods, developed for word embeddings, can help when applied to BERT's
internal representations. Projective methods are fast to implement, use a small
number of saved parameters, and make no updates to the existing model
parameters. We evaluate the efficacy of the methods in reducing both intrinsic
bias, as measured by BERT's next sentence prediction task, and in mitigating
observed bias in a downstream setting when fine-tuned. To this end, we also
provide a critical analysis of a popular gender-bias assessment test for
quantifying intrinsic bias, resulting in an enhanced test set and new bias
measures. We find that projective methods can be effective at both intrinsic
bias and downstream bias mitigation, but that the two outcomes are not
necessarily correlated. This finding serves as a warning that intrinsic bias
test sets, based either on language modeling tasks or next sentence prediction,
should not be the only benchmark in developing a debiased language model.
☆ Long-form factuality in large language models
Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
Large language models (LLMs) often generate content that contains factual
errors when responding to fact-seeking prompts on open-ended topics. To
benchmark a model's long-form factuality in open domains, we first use GPT-4 to
generate LongFact, a prompt set comprising thousands of questions spanning 38
topics. We then propose that LLM agents can be used as automated evaluators for
long-form factuality through a method which we call Search-Augmented Factuality
Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into
a set of individual facts and to evaluate the accuracy of each fact using a
multi-step reasoning process comprising sending search queries to Google Search
and determining whether a fact is supported by the search results. Furthermore,
we propose extending F1 score as an aggregated metric for long-form factuality.
To do so, we balance the percentage of supported facts in a response
(precision) with the percentage of provided facts relative to a hyperparameter
representing a user's preferred response length (recall).
Empirically, we demonstrate that LLM agents can achieve superhuman rating
performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced
human annotators 72% of the time, and on a random subset of 100 disagreement
cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times
cheaper than human annotators. We also benchmark thirteen language models on
LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding
that larger language models generally achieve better long-form factuality.
LongFact, SAFE, and all experimental code are available at
https://github.com/google-deepmind/long-form-factuality.
☆ Towards a World-English Language Model for On-Device Virtual Assistants ICASSP 2024
Neural Network Language Models (NNLMs) for Virtual Assistants (VAs) are
generally language-, region-, and in some cases, device-dependent, which
increases the effort to scale and maintain them. Combining NNLMs for one or
more of the categories is one way to improve scalability. In this work, we
combine regional variants of English to build a ``World English'' NNLM for
on-device VAs. In particular, we investigate the application of adapter
bottlenecks to model dialect-specific characteristics in our existing
production NNLMs {and enhance the multi-dialect baselines}. We find that
adapter modules are more effective in modeling dialects than specializing
entire sub-networks. Based on this insight and leveraging the design of our
production models, we introduce a new architecture for World English NNLM that
meets the accuracy, latency, and memory constraints of our single-dialect
models.
comment: Accepted in ICASSP 2024
☆ CheckEval: Robust Evaluation Framework using Large Language Model via Checklist
We introduce CheckEval, a novel evaluation framework using Large Language
Models, addressing the challenges of ambiguity and inconsistency in current
evaluation methods. CheckEval addresses these challenges by dividing evaluation
criteria into detailed sub-aspects and constructing a checklist of Boolean
questions for each, simplifying the evaluation. This approach not only renders
the process more interpretable but also significantly enhances the robustness
and reliability of results by focusing on specific evaluation dimensions.
Validated through a focused case study using the SummEval benchmark, CheckEval
indicates a strong correlation with human judgments. Furthermore, it
demonstrates a highly consistent Inter-Annotator Agreement. These findings
highlight the effectiveness of CheckEval for objective, flexible, and precise
evaluations. By offering a customizable and interactive framework, CheckEval
sets a new standard for the use of LLMs in evaluation, responding to the
evolving needs of the field and establishing a clear method for future
LLM-based evaluation.
comment: HEAL at CHI 2024
☆ Improved Neural Protoform Reconstruction via Reflex Prediction LREC
Protolanguage reconstruction is central to historical linguistics. The
comparative method, one of the most influential theoretical and methodological
frameworks in the history of the language sciences, allows linguists to infer
protoforms (reconstructed ancestral words) from their reflexes (related modern
words) based on the assumption of regular sound change. Not surprisingly,
numerous computational linguists have attempted to operationalize comparative
reconstruction through various computational models, the most successful of
which have been supervised encoder-decoder models, which treat the problem of
predicting protoforms given sets of reflexes as a sequence-to-sequence problem.
We argue that this framework ignores one of the most important aspects of the
comparative method: not only should protoforms be inferable from cognate sets
(sets of related reflexes) but the reflexes should also be inferable from the
protoforms. Leveraging another line of research -- reflex prediction -- we
propose a system in which candidate protoforms from a reconstruction model are
reranked by a reflex prediction model. We show that this more complete
implementation of the comparative method allows us to surpass state-of-the-art
protoform reconstruction methods on three of four Chinese and Romance datasets.
comment: Accepted to LREC-COLING 2024
☆ CYCLE: Learning to Self-Refine the Code Generation
Pre-trained code language models have achieved promising performance in code
generation and improved the programming efficiency of human developers.
However, their self-refinement capability is typically overlooked by the
existing evaluations of code LMs, which focus only on the accuracy of the
one-time prediction. For the cases when code LMs fail to implement the correct
program, developers actually find it hard to debug and fix the faulty
prediction since it is not written by the developers themselves. Unfortunately,
our study reveals that code LMs cannot efficiently self-refine their faulty
generations as well.
In this paper, we propose CYCLE framework, learning to self-refine the faulty
generation according to the available feedback, such as the execution results
reported by the test suites. We evaluate CYCLE on three popular code generation
benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE
successfully maintains, sometimes improves, the quality of one-time code
generation, while significantly improving the self-refinement capability of
code LMs. We implement four variants of CYCLE with varied numbers of parameters
across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently
boosts the code generation performance, by up to 63.5%, across benchmarks and
varied model sizes. We also notice that CYCLE outperforms code LMs that have
3$\times$ more parameters in self-refinement.
comment: Camera-ready for OOPSLA'24
☆ Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Large Vision-Language Models (LVLMs) are increasingly adept at generating
contextually detailed and coherent responses from visual inputs. However, their
application in multimodal decision-making and open-ended generation is hindered
by a notable rate of hallucinations, where generated text inaccurately
represents the visual contents. To address this issue, this paper introduces
the Instruction Contrastive Decoding (ICD) method, a novel approach designed to
reduce hallucinations during LVLM inference. Our method is inspired by our
observation that what we call disturbance instructions significantly exacerbate
hallucinations in multimodal fusion modules. ICD contrasts distributions from
standard and instruction disturbance, thereby increasing alignment uncertainty
and effectively subtracting hallucinated concepts from the original
distribution. Through comprehensive experiments on discriminative benchmarks
(POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that
ICD significantly mitigates both object-level and attribute-level
hallucinations. Moreover, our method not only addresses hallucinations but also
significantly enhances the general perception and recognition capabilities of
LVLMs.
☆ The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian
While Italian is by all metrics a high resource language, currently, there
are isn't a Language Model pre-trained exclusively in this language. This
results in a lower number of available benchmarks to evaluate the performance
of language models in Italian.
This work presents two new benchmarks to evaluate the models performance on
mathematical understanding and language understanding in Italian. These
benchmarks are based on real tests that are undertaken by students of age
between 11 and 18 within the Italian school system and have therefore been
validated by several experts in didactics and pedagogy.
To validate this dataset we evaluate the performance of 9 language models
that are the best performing when writing in Italian, including our own
fine-tuned models. We show that this is a challenging benchmark where current
language models are bound by 60\% accuracy.
We believe that the release of this dataset paves the way for improving
future models mathematical and language understanding in Italian.
☆ Scaling Laws For Dense Retrieval SIGIR 2024
Scaling up neural models has yielded significant advancements in a wide array
of tasks, particularly in language generation. Previous studies have found that
the performance of neural models frequently adheres to predictable scaling
laws, correlated with factors such as training set size and model size. This
insight is invaluable, especially as large-scale experiments grow increasingly
resource-intensive. Yet, such scaling law has not been fully explored in dense
retrieval due to the discrete nature of retrieval metrics and complex
relationships between training data and model sizes in retrieval tasks. In this
study, we investigate whether the performance of dense retrieval models follows
the scaling law as other neural models. We propose to use contrastive
log-likelihood as the evaluation metric and conduct extensive experiments with
dense retrieval models implemented with different numbers of parameters and
trained with different amounts of annotated data. Results indicate that, under
our settings, the performance of dense retrieval models follows a precise
power-law scaling related to the model size and the number of annotations.
Additionally, we examine scaling with prevalent data augmentation methods to
assess the impact of annotation quality, and apply the scaling law to find the
best resource allocation strategy under a budget constraint. We believe that
these insights will significantly contribute to understanding the scaling
effect of dense retrieval models and offer meaningful guidance for future
research endeavors.
comment: Accepted at SIGIR 2024
☆ NL-ITI: Optimizing Probing and Intervention for Improvement of ITI Method
Jakub Hoscilowicz, Adam Wiacek, Jan Chojnacki, Adam Cieslak, Leszek Michon, Vitalii Urbanevych, Artur Janicki
Large Language Models (LLM) are prone to returning false information. It
constitutes one of major challenges in the AI field. In our work, we explore
paradigm introduced by Inference-Time-Intervention (ITI). In first stage, it
identifies attention heads, which contain the highest amount of desired type of
knowledge (e.g., truthful). Afterwards, during inference, LLM activations are
shifted for chosen subset of attention heads. We further improved the ITI
framework by introducing a nonlinear probing and multi-token intervention -
Non-Linear ITI (NL-ITI). NL-ITI is tested on diverse multiple-choice
benchmarks, including TruthfulQA, on which we report around 14% MC1 metric
improvement with respect to the baseline ITI results. NL-ITI achieves also
encouraging results on other testsets - on Business Ethics subdomain of MMLU,
around 18% MC1 improvement over baseline LLaMA2-7B. Additionally, NL-ITI
performs better while being less invasive in the behavior of LLM at the same
time (as measured by Kullback-Leibler divergence).
comment: Code is available at https://github.com/Samsung/NL-ITI
★ Fact Checking Beyond Training Set NAACL 2024
Evaluating the veracity of everyday claims is time consuming and in some
cases requires domain expertise. We empirically demonstrate that the commonly
used fact checking pipeline, known as the retriever-reader, suffers from
performance deterioration when it is trained on the labeled data from one
domain and used in another domain. Afterwards, we delve into each component of
the pipeline and propose novel algorithms to address this problem. We propose
an adversarial algorithm to make the retriever component robust against
distribution shift. Our core idea is to initially train a bi-encoder on the
labeled source data, and then, to adversarially train two separate document and
claim encoders using unlabeled target data. We then focus on the reader
component and propose to train it such that it is insensitive towards the order
of claims and evidence documents. Our empirical evaluations support the
hypothesis that such a reader shows a higher robustness against distribution
shift. To our knowledge, there is no publicly available multi-topic fact
checking dataset. Thus, we propose a simple automatic method to re-purpose two
well-known fact checking datasets. We then construct eight fact checking
scenarios from these datasets, and compare our model to a set of strong
baseline models, including recent domain adaptation models that use GPT4 for
generating synthetic data.
comment: NAACL 2024
☆ Improving Content Recommendation: Knowledge Graph-Based Semantic Contrastive Learning for Diversity and Cold-Start Users LREC
Yejin Kim, Scott Rome, Kevin Foley, Mayur Nankani, Rimon Melamed, Javier Morales, Abhay Yadav, Maria Peifer, Sardar Hamidian, H. Howie Huang
Addressing the challenges related to data sparsity, cold-start problems, and
diversity in recommendation systems is both crucial and demanding. Many current
solutions leverage knowledge graphs to tackle these issues by combining both
item-based and user-item collaborative signals. A common trend in these
approaches focuses on improving ranking performance at the cost of escalating
model complexity, reducing diversity, and complicating the task. It is
essential to provide recommendations that are both personalized and diverse,
rather than solely relying on achieving high rank-based performance, such as
Click-through Rate, Recall, etc. In this paper, we propose a hybrid multi-task
learning approach, training on user-item and item-item interactions. We apply
item-based contrastive learning on descriptive text, sampling positive and
negative pairs based on item metadata. Our approach allows the model to better
understand the relationships between entities within the knowledge graph by
utilizing semantic information from text. It leads to more accurate, relevant,
and diverse user recommendations and a benefit that extends even to cold-start
users who have few interactions with items. We perform extensive experiments on
two widely used datasets to validate the effectiveness of our approach. Our
findings demonstrate that jointly training user-item interactions and
item-based signals using synopsis text is highly effective. Furthermore, our
results provide evidence that item-based contrastive learning enhances the
quality of entity embeddings, as indicated by metrics such as uniformity and
alignment.
comment: Accepted at LREC-COLING 2024
☆ SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens
We propose an acceleration scheme for large language models (LLMs) through
Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary
objective of this design is to enhance the LLM model's ability to generate
draft tokens more accurately without compromising the model's accuracy. The
core strategies involve: 1) Fine-tune the model by incorporating semantic
adaptive tokens that possess flexible decoding capabilities without changing
its structure, allowing them to generate high-quality draft tokens. 2) By
employing a training method that does not affect the standard tokens, the model
can acquire parallel decoding abilities atop its original framework with
minimal training overhead. 3) We have designed the "two-step-draft-then-verify"
generation strategies using both greedy search and nucleus sampling.
Experiments conducted on the CodeLlama-13B and 7B models have yielded speed
increases of over 3.5X and 3.0X, respectively. Please refer to
https://github.com/hasuoshenyun/SDSAT.
comment: 12 pages, 7 figures
☆ Vulnerability Detection with Code Language Models: How Far Are We?
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen
In the context of the rising interest in code language models (code LMs) and
vulnerability detection, we study the effectiveness of code LMs for detecting
vulnerabilities. Our analysis reveals significant shortcomings in existing
vulnerability datasets, including poor data quality, low label accuracy, and
high duplication rates, leading to unreliable model performance in realistic
vulnerability detection scenarios. Additionally, the evaluation methods used
with these datasets are not representative of real-world vulnerability
detection.
To address these challenges, we introduce PrimeVul, a new dataset for
training and evaluating code LMs for vulnerability detection. PrimeVul
incorporates a novel set of data labeling techniques that achieve comparable
label accuracy to human-verified benchmarks while significantly expanding the
dataset. It also implements a rigorous data de-duplication and chronological
data splitting strategy to mitigate data leakage issues, alongside introducing
more realistic evaluation metrics and settings. This comprehensive approach
aims to provide a more accurate assessment of code LMs' performance in
real-world conditions.
Evaluating code LMs on PrimeVul reveals that existing benchmarks
significantly overestimate the performance of these models. For instance, a
state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on
PrimeVul. Attempts to improve performance through advanced training techniques
and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin
to random guessing in the most stringent settings. These findings underscore
the considerable gap between current capabilities and the practical
requirements for deploying code LMs in security roles, highlighting the need
for more innovative research in this domain.
☆ A survey on learning models of spiking neural membrane systems and spiking neural networks
Spiking neural networks (SNN) are a biologically inspired model of neural
networks with certain brain-like properties. In the past few decades, this
model has received increasing attention in computer science community, owing
also to the successful phenomenon of deep learning. In SNN, communication
between neurons takes place through the spikes and spike trains. This
differentiates these models from the ``standard'' artificial neural networks
(ANN) where the frequency of spikes is replaced by real-valued signals. Spiking
neural P systems (SNPS) can be considered a branch of SNN based more on the
principles of formal automata, with many variants developed within the
framework of the membrane computing theory. In this paper, we first briefly
compare structure and function, advantages and drawbacks of SNN and SNPS. A key
part of the article is a survey of recent results and applications of machine
learning and deep learning models of both SNN and SNPS formalisms.
☆ Debiasing Sentence Embedders through Contrastive Word Pairs
Over the last years, various sentence embedders have been an integral part in
the success of current machine learning approaches to Natural Language
Processing (NLP). Unfortunately, multiple sources have shown that the bias,
inherent in the datasets upon which these embedding methods are trained, is
learned by them. A variety of different approaches to remove biases in
embeddings exists in the literature. Most of these approaches are applicable to
word embeddings and in fewer cases to sentence embeddings. It is problematic
that most debiasing approaches are directly transferred from word embeddings,
therefore these approaches fail to take into account the nonlinear nature of
sentence embedders and the embeddings they produce. It has been shown in
literature that bias information is still present if sentence embeddings are
debiased using such methods. In this contribution, we explore an approach to
remove linear and nonlinear bias information for NLP solutions, without
impacting downstream performance. We compare our approach to common debiasing
methods on classical bias metrics and on bias metrics which take nonlinear
information into account.
☆ Attention-aware semantic relevance predicting Chinese sentence reading
In recent years, several influential computational models and metrics have
been proposed to predict how humans comprehend and process sentence. One
particularly promising approach is contextual semantic similarity. Inspired by
the attention algorithm in Transformer and human memory mechanisms, this study
proposes an ``attention-aware'' approach for computing contextual semantic
relevance. This new approach takes into account the different contributions of
contextual parts and the expectation effect, allowing it to incorporate
contextual information fully. The attention-aware approach also facilitates the
simulation of existing reading models and evaluate them. The resulting
``attention-aware'' metrics of semantic relevance can more accurately predict
fixation durations in Chinese reading tasks recorded in an eye-tracking corpus
than those calculated by existing approaches. The study's findings further
provide strong support for the presence of semantic preview benefits in Chinese
naturalistic reading. Furthermore, the attention-aware metrics of semantic
relevance, being memory-based, possess high interpretability from both
linguistic and cognitive standpoints, making them a valuable computational tool
for modeling eye-movements in reading and further gaining insight into the
process of language comprehension. Our approach underscores the potential of
these metrics to advance our comprehension of how humans understand and process
language, ultimately leading to a better understanding of language
comprehension and processing.
☆ A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networks
Axel Constant, Hannes Westermann, Bryan Wilson, Alex Kiefer, Ines Hipolito, Sylvain Pronovost, Steven Swanson, Mahault Albarracin, Maxwell J. D. Ramstead
Legal autonomy - the lawful activity of artificial intelligence agents - can
be achieved in one of two ways. It can be achieved either by imposing
constraints on AI actors such as developers, deployers and users, and on AI
resources such as data, or by imposing constraints on the range and scope of
the impact that AI agents can have on the environment. The latter approach
involves encoding extant rules concerning AI driven devices into the software
of AI agents controlling those devices (e.g., encoding rules about limitations
on zones of operations into the agent software of an autonomous drone device).
This is a challenge since the effectivity of such an approach requires a method
of extracting, loading, transforming and computing legal information that would
be both explainable and legally interoperable, and that would enable AI agents
to reason about the law. In this paper, we sketch a proof of principle for such
a method using large language models (LLMs), expert legal systems known as
legal decision paths, and Bayesian networks. We then show how the proposed
method could be applied to extant regulation in matters of autonomous cars,
such as the California Vehicle Code.
☆ Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
Vision-language models, such as CLIP, have shown promising
Out-of-Distribution (OoD) generalization under various types of distribution
shifts. Recent studies attempted to investigate the leading cause of this
capability. In this work, we follow the same path, but focus on a specific type
of OoD data - images with novel compositions of attribute-object pairs - and
study whether such models can successfully classify those images into
composition classes. We carefully designed an authentic image test dataset
called ImageNet-AO, consisting of attributes for objects that are unlikely
encountered in the CLIP training sets. We found that CLIPs trained with large
datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude
improvement in effective compositional OoD generalization compared to both
supervised models and CLIPs trained with smaller datasets, such as CC-12M and
YFCC-15M. Our results provide evidence that the scale and diversity of training
data and language supervision play a key role in unlocking the compositional
generalization abilities of vision-language models.
comment: Oral accepted at OODCV 2023(http://www.ood-cv.org)
☆ AcTED: Automatic Acquisition of Typical Event Duration for Semi-supervised Temporal Commonsense QA
We propose a voting-driven semi-supervised approach to automatically acquire
the typical duration of an event and use it as pseudo-labeled data. The human
evaluation demonstrates that our pseudo labels exhibit surprisingly high
accuracy and balanced coverage. In the temporal commonsense QA task,
experimental results show that using only pseudo examples of 400 events, we
achieve performance comparable to the existing BERT-based weakly supervised
approaches that require a significant amount of training examples. When
compared to the RoBERTa baselines, our best approach establishes
state-of-the-art performance with a 7% improvement in Exact Match.
☆ Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction CVPR 2024
Language models have demonstrated impressive ability in context understanding
and generative performance. Inspired by the recent success of language
foundation models, in this paper, we propose LMTraj (Language-based Multimodal
Trajectory predictor), which recasts the trajectory prediction task into a sort
of question-answering problem. Departing from traditional numerical regression
models, which treat the trajectory coordinate sequence as continuous signals,
we consider them as discrete signals like text prompts. Specially, we first
transform an input space for the trajectory coordinate into the natural
language space. Here, the entire time-series trajectories of pedestrians are
converted into a text prompt, and scene images are described as text
information through image captioning. The transformed numerical and image data
are then wrapped into the question-answering template for use in a language
model. Next, to guide the language model in understanding and reasoning
high-level knowledge, such as scene context and social relationships between
pedestrians, we introduce an auxiliary multi-task question and answering. We
then train a numerical tokenizer with the prompt data. We encourage the
tokenizer to separate the integer and decimal parts well, and leverage it to
capture correlations between the consecutive numbers in the language model.
Lastly, we train the language model using the numerical tokenizer and all of
the question-answer prompts. Here, we propose a beam-search-based most-likely
prediction and a temperature-based multimodal prediction to implement both
deterministic and stochastic inferences. Applying our LMTraj, we show that the
language-based model can be a powerful pedestrian trajectory predictor, and
outperforms existing numerical-based predictor methods. Code is publicly
available at https://github.com/inhwanbae/LMTrajectory .
comment: Accepted at CVPR 2024
☆ DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment
Recent research demonstrates the effectiveness of using pre-trained language
models for legal case retrieval. Most of the existing works focus on improving
the representation ability for the contextualized embedding of the [CLS] token
and calculate relevance using textual semantic similarity. However, in the
legal domain, textual semantic similarity does not always imply that the cases
are relevant enough. Instead, relevance in legal cases primarily depends on the
similarity of key facts that impact the final judgment. Without proper
treatments, the discriminative ability of learned representations could be
limited since legal cases are lengthy and contain numerous non-key facts. To
this end, we introduce DELTA, a discriminative model designed for legal case
retrieval. The basic idea involves pinpointing key facts in legal cases and
pulling the contextualized embedding of the [CLS] token closer to the key facts
while pushing away from the non-key facts, which can warm up the case embedding
space in an unsupervised manner. To be specific, this study brings the word
alignment mechanism to the contextual masked auto-encoder. First, we leverage
shallow decoders to create information bottlenecks, aiming to enhance the
representation ability. Second, we employ the deep decoder to enable
translation between different structures, with the goal of pinpointing key
facts to enhance discriminative ability. Comprehensive experiments conducted on
publicly available legal benchmarks show that our approach can outperform
existing state-of-the-art methods in legal case retrieval. It provides a new
perspective on the in-depth understanding and processing of legal case
documents.
comment: 11 pages
☆ Exploring language relations through syntactic distances and geographic proximity
Languages are grouped into families that share common linguistic traits.
While this approach has been successful in understanding genetic relations
between diverse languages, more analyses are needed to accurately quantify
their relatedness, especially in less studied linguistic levels such as syntax.
Here, we explore linguistic distances using series of parts of speech (POS)
extracted from the Universal Dependencies dataset. Within an
information-theoretic framework, we show that employing POS trigrams maximizes
the possibility of capturing syntactic variations while being at the same time
compatible with the amount of available data. Linguistic connections are then
established by assessing pairwise distances based on the POS distributions.
Intriguingly, our analysis reveals definite clusters that correspond to well
known language families and groups, with exceptions explained by distinct
morphological typologies. Furthermore, we obtain a significant correlation
between language similarity and geographic distance, which underscores the
influence of spatial proximity on language kinships.
comment: 36 pages
☆ TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions SIGIR 2024
Nowadays, individuals tend to engage in dialogues with Large Language Models,
seeking answers to their questions. In times when such answers are readily
accessible to anyone, the stimulation and preservation of human's cognitive
abilities, as well as the assurance of maintaining good reasoning skills by
humans becomes crucial. This study addresses such needs by proposing hints
(instead of final answers or before giving answers) as a viable solution. We
introduce a framework for the automatic hint generation for factoid questions,
employing it to construct TriviaHG, a novel large-scale dataset featuring
160,230 hints corresponding to 16,645 questions from the TriviaQA dataset.
Additionally, we present an automatic evaluation method that measures the
Convergence and Familiarity quality attributes of hints. To evaluate the
TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals
to annotate 2,791 hints and tasked 6 humans with answering questions using the
provided hints. The effectiveness of hints varied, with success rates of 96%,
78%, and 36% for questions with easy, medium, and hard answers, respectively.
Moreover, the proposed automatic evaluation methods showed a robust correlation
with annotators' results. Conclusively, the findings highlight three key
insights: the facilitative role of hints in resolving unknown questions, the
dependence of hint quality on answer difficulty, and the feasibility of
employing automatic evaluation methods for hint assessment.
comment: Accepted at SIGIR 2024
☆ SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks NAACL 2024
Language models (LMs) are indispensable tools for natural language processing
tasks, but their vulnerability to adversarial attacks remains a concern. While
current research has explored adversarial training techniques, their
improvements to defend against word-level attacks have been limited. In this
work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a
Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing
inspiration from recent studies in the image domain, we investigate and later
confirm that in a discrete data setting such as language, adversarial samples
generated via word substitutions do indeed belong to an adversarial domain
exhibiting a high Wasserstein distance from the base domain. Our method learns
a robust representation that bridges these two domains. We hypothesize that if
samples were not projected into an adversarial domain, but instead to a domain
with minimal shift, it would improve attack robustness. We align the domains by
incorporating a new distance-based objective. With this, our model is able to
learn more generalized representations by aligning the model's high-level
output features and therefore better handling unseen adversarial samples. This
method can be generalized across word embeddings, even when they share minimal
overlap at both vocabulary and word-substitution levels. To evaluate the
effectiveness of our approach, we conduct experiments on BERT and RoBERTa
models on three datasets. The results demonstrate promising state-of-the-art
robustness.
comment: Published in NAACL 2024 (Main Track)
★ BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, Christopher D. Manning
Models such as GPT-4 and Med-PaLM 2 have demonstrated impressive performance
on a wide variety of biomedical NLP tasks. However, these models have hundreds
of billions of parameters, are computationally expensive to run, require users
to send their input data over the internet, and are trained on unknown data
sources. Can smaller, more targeted models compete? To address this question,
we build and release BioMedLM, a 2.7 billion parameter GPT-style autoregressive
model trained exclusively on PubMed abstracts and full articles. When
fine-tuned, BioMedLM can produce strong multiple-choice biomedical
question-answering results competitive with much larger models, such as
achieving a score of 57.3% on MedMCQA (dev) and 69.0% on the MMLU Medical
Genetics exam. BioMedLM can also be fine-tuned to produce useful answers to
patient questions on medical topics. This demonstrates that smaller models can
potentially serve as transparent, privacy-preserving, economical and
environmentally friendly foundations for particular NLP applications, such as
in biomedicine. The model is available on the Hugging Face Hub:
https://huggingface.co/stanford-crfm/BioMedLM.
comment: 23 pages
☆ An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Stimulated by the sophisticated reasoning capabilities of recent Large
Language Models (LLMs), a variety of strategies for bridging video modality
have been devised. A prominent strategy involves Video Language Models
(VideoLMs), which train a learnable interface with video data to connect
advanced vision encoders with LLMs. Recently, an alternative strategy has
surfaced, employing readily available foundation models, such as VideoLMs and
LLMs, across multiple stages for modality bridging. In this study, we introduce
a simple yet novel strategy where only a single Vision Language Model (VLM) is
utilized. Our starting point is the plain insight that a video comprises a
series of images, or frames, interwoven with temporal information. The essence
of video comprehension lies in adeptly managing the temporal aspects along with
the spatial details of each frame. Initially, we transform a video into a
single composite image by arranging multiple frames in a grid layout. The
resulting single image is termed as an image grid. This format, while
maintaining the appearance of a solitary image, effectively retains temporal
information within the grid structure. Therefore, the image grid approach
enables direct application of a single high-performance VLM without
necessitating any video-data training. Our extensive experimental analysis
across ten zero-shot video question answering benchmarks, including five
open-ended and five multiple-choice benchmarks, reveals that the proposed Image
Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out
of ten benchmarks.
comment: Our code is available at https://github.com/imagegridworth/IG-VLM
☆ Improving Attributed Text Generation of Large Language Models via Preference Learning
Large language models have been widely adopted in natural language
processing, yet they face the challenge of generating unreliable content.
Recent works aim to reduce misinformation and hallucinations by resorting to
attribution as a means to provide evidence (i.e., citations). However, current
attribution methods usually focus on the retrieval stage and automatic
evaluation that neglect mirroring the citation mechanisms in human scholarly
writing to bolster credibility. In this paper, we address these challenges by
modelling the attribution task as preference learning and introducing an
Automatic Preference Optimization (APO) framework. First, we create a curated
collection for post-training with 6,330 examples by collecting and filtering
from existing datasets. Second, considering the high cost of labelling
preference data, we further propose an automatic method to synthesize
attribution preference data resulting in 95,263 pairs. Moreover, inspired by
the human citation process, we further propose a progressive preference
optimization method by leveraging fine-grained information. Extensive
experiments on three datasets (i.e., ASQA, StrategyQA, and ELI5) demonstrate
that APO achieves state-of-the-art citation F1 with higher answer quality.
comment: 23 pages, 15 tables, 2 figures
☆ BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models
Large Language Models (LLMs) like ChatGPT and GPT-4 are versatile and capable
of addressing a diverse range of tasks. However, general LLMs, which are
developed on open-domain data, may lack the domain-specific knowledge essential
for tasks in vertical domains, such as legal, medical, etc. To address this
issue, previous approaches either conduct continuous pre-training with
domain-specific data or employ retrieval augmentation to support general LLMs.
Unfortunately, these strategies are either cost-intensive or unreliable in
practical applications. To this end, we present a novel framework named BLADE,
which enhances Black-box LArge language models with small Domain-spEcific
models. BLADE consists of a black-box LLM and a small domain-specific LM. The
small LM preserves domain-specific knowledge and offers specialized insights,
while the general LLM contributes robust language comprehension and reasoning
capabilities. Specifically, our method involves three steps: 1) pre-training
the small LM with domain-specific data, 2) fine-tuning this model using
knowledge instruction data, and 3) joint Bayesian optimization of the general
LLM and the small LM. Extensive experiments conducted on public legal and
medical benchmarks reveal that BLADE significantly outperforms existing
approaches. This shows the potential of BLADE as an effective and
cost-efficient solution in adapting general LLMs for vertical domains.
comment: 11pages
☆ Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language
The latest advancements in machine learning and deep learning have brought
forth the concept of semantic similarity, which has proven immensely beneficial
in multiple applications and has largely replaced keyword search. However,
evaluating semantic similarity and conducting searches for a specific query
across various documents continue to be a complicated task. This complexity is
due to the multifaceted nature of the task, the lack of standard benchmarks,
whereas these challenges are further amplified for Arabic language. This paper
endeavors to establish a straightforward yet potent benchmark for semantic
search in Arabic. Moreover, to precisely evaluate the effectiveness of these
metrics and the dataset, we conduct our assessment of semantic search within
the framework of retrieval augmented generation (RAG).
☆ Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
Large Language Models (LLMs) often generate erroneous outputs, known as
hallucinations, due to their limitations in discerning questions beyond their
knowledge scope. While addressing hallucination has been a focal point in
research, previous efforts primarily concentrate on enhancing correctness
without giving due consideration to the significance of rejection mechanisms.
In this paper, we conduct a comprehensive examination of the role of rejection,
introducing the notion of model reliability along with corresponding metrics.
These metrics measure the model's ability to provide accurate responses while
adeptly rejecting questions exceeding its knowledge boundaries, thereby
minimizing hallucinations. To improve the inherent reliability of LLMs, we
present a novel alignment framework called Reinforcement Learning from
Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically
determine the model's knowledge boundary and trains a reliable reward model to
encourage the refusal of out-of-knowledge questions. Experimental results on
mathematical questions affirm the substantial efficacy of RLKF in significantly
enhancing LLM reliability.
☆ Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective
Recent advancements in Large Language Models (LLMs) have facilitated the
development of Multimodal LLMs (MLLMs). Despite their impressive capabilities,
MLLMs often suffer from an over-reliance on unimodal biases (e.g., language
bias and vision bias), leading to incorrect answers in complex multimodal
tasks. To investigate this issue, we propose a causal framework to interpret
the biases in Visual Question Answering (VQA) problems. Within our framework,
we devise a causal graph to elucidate the predictions of MLLMs on VQA problems,
and assess the causal effect of biases through an in-depth causal analysis.
Motivated by the causal graph, we introduce a novel MORE dataset, consisting of
12,000 VQA instances. This dataset is designed to challenge MLLMs' abilities,
necessitating multi-hop reasoning and the surmounting of unimodal biases.
Furthermore, we propose two strategies to mitigate unimodal biases and enhance
MLLMs' reasoning capabilities, including a Decompose-Verify-Answer (DeVA)
framework for limited-access MLLMs and the refinement of open-source MLLMs
through fine-tuning. Extensive quantitative and qualitative experiments offer
valuable insights for future research.
☆ IterAlign: Iterative Constitutional Alignment of Large Language Models NAACL 2024
With the rapid development of large language models (LLMs), aligning LLMs
with human values and societal norms to ensure their reliability and safety has
become crucial. Reinforcement learning with human feedback (RLHF) and
Constitutional AI (CAI) have been proposed for LLM alignment. However, these
methods require either heavy human annotations or explicitly pre-defined
constitutions, which are labor-intensive and resource-consuming. To overcome
these drawbacks, we study constitution-based LLM alignment and propose a
data-driven constitution discovery and self-alignment framework called
IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM
and automatically discovers new constitutions using a stronger LLM. These
constitutions are then used to guide self-correction of the base LLM. Such a
constitution discovery pipeline can be run iteratively and automatically to
discover new constitutions that specifically target the alignment gaps in the
current LLM. Empirical results on several safety benchmark datasets and
multiple base LLMs show that IterAlign successfully improves truthfulness,
helpfulness, harmlessness and honesty, improving the LLM alignment by up to
$13.5\%$ in harmlessness.
comment: NAACL 2024
☆ A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages LREC
Lisa Raithel, Hui-Syuan Yeh, Shuntaro Yada, Cyril Grouin, Thomas Lavergne, Aurélie Névéol, Patrick Paroubek, Philippe Thomas, Tomohiro Nishiyama, Sebastian Möller, Eiji Aramaki, Yuji Matsumoto, Roland Roller, Pierre Zweigenbaum
User-generated data sources have gained significance in uncovering Adverse
Drug Reactions (ADRs), with an increasing number of discussions occurring in
the digital world. However, the existing clinical corpora predominantly revolve
around scientific articles in English. This work presents a multilingual corpus
of texts concerning ADRs gathered from diverse sources, including patient fora,
social media, and clinical reports in German, French, and Japanese. Our corpus
contains annotations covering 12 entity types, four attribute types, and 13
relation types. It contributes to the development of real-world multilingual
language models for healthcare. We provide statistics to highlight certain
challenges associated with the corpus and conduct preliminary experiments
resulting in strong baselines for extracting entities and relations between
these entities, both within and across languages.
comment: Accepted at LREC-COLING 2024
☆ Can LLMs Converse Formally? Automatically Assessing LLMs in Translating and Interpreting Formal Specifications
Stakeholders often describe system requirements using natural language which
are then converted to formal syntax by a domain-expert leading to increased
design costs. This paper assesses the capabilities of Large Language Models
(LLMs) in converting between natural language descriptions and formal
specifications. Existing work has evaluated the capabilities of LLMs in
generating formal syntax such as source code but such experiments are typically
hand-crafted and use problems that are likely to be in the training set of
LLMs, and often require human-annotated datasets. We propose an approach that
can use two copies of an LLM in conjunction with an off-the-shelf verifier to
automatically evaluate its translation abilities without any additional human
input. Our approach generates formal syntax using language grammars to
automatically generate a dataset. We conduct an empirical evaluation to measure
the accuracy of this translation task and show that SOTA LLMs cannot adequately
solve this task, limiting their current utility in the design of complex
systems.
☆ Chinese Offensive Language Detection:Current Status and Future Directions
Despite the considerable efforts being made to monitor and regulate
user-generated content on social media platforms, the pervasiveness of
offensive language, such as hate speech or cyberbullying, in the digital space
remains a significant challenge. Given the importance of maintaining a
civilized and respectful online environment, there is an urgent and growing
need for automatic systems capable of detecting offensive speech in real time.
However, developing effective systems for processing languages such as Chinese
presents a significant challenge, owing to the language's complex and nuanced
nature, which makes it difficult to process automatically. This paper provides
a comprehensive overview of offensive language detection in Chinese, examining
current benchmarks and approaches and highlighting specific models and tools
for addressing the unique challenges of detecting offensive language in this
complex language. The primary objective of this survey is to explore the
existing techniques and identify potential avenues for further research that
can address the cultural and linguistic complexities of Chinese.
☆ Dual Instruction Tuning with Large Language Models for Mathematical Reasoning
Recent advancements highlight the success of instruction tuning with large
language models (LLMs) utilizing Chain-of-Thought (CoT) data for mathematical
reasoning tasks. Despite the fine-tuned LLMs, challenges persist, such as
incorrect, missing, and redundant steps in CoT generation leading to
inaccuracies in answer predictions. To alleviate this problem, we propose a
dual instruction tuning strategy to meticulously model mathematical reasoning
from both forward and reverse directions. This involves introducing the
Intermediate Reasoning State Prediction task (forward reasoning) and the
Instruction Reconstruction task (reverse reasoning) to enhance the LLMs'
understanding and execution of instructions. Training instances for these tasks
are constructed based on existing mathematical instruction tuning datasets.
Subsequently, LLMs undergo multi-task fine-tuning using both existing
mathematical instructions and the newly created data. Comprehensive experiments
validate the effectiveness and domain generalization of the dual instruction
tuning strategy across various mathematical reasoning tasks.
☆ Few-Shot Recalibration of Language Models
Recent work has uncovered promising ways to extract well-calibrated
confidence estimates from language models (LMs), where the model's confidence
score reflects how likely it is to be correct. However, while LMs may appear
well-calibrated over broad distributions, this often hides significant
miscalibration within narrower slices (e.g., systemic over-confidence in math
can balance out systemic under-confidence in history, yielding perfect
calibration in aggregate). To attain well-calibrated confidence estimates for
any slice of a distribution, we propose a new framework for few-shot
slice-specific recalibration. Specifically, we train a recalibration model that
takes in a few unlabeled examples from any given slice and predicts a curve
that remaps confidence scores to be more accurate for that slice. Our trained
model can recalibrate for arbitrary new slices, without using any labeled data
from that slice. This enables us to identify domain-specific confidence
thresholds above which the LM's predictions can be trusted, and below which it
should abstain. Experiments show that our few-shot recalibrator consistently
outperforms existing calibration methods, for instance improving calibration
error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.
comment: preprint
☆ BlendX: Complex Multi-Intent Detection with Blended Patterns LREC
Task-oriented dialogue (TOD) systems are commonly designed with the
presumption that each utterance represents a single intent. However, this
assumption may not accurately reflect real-world situations, where users
frequently express multiple intents within a single utterance. While there is
an emerging interest in multi-intent detection (MID), existing in-domain
datasets such as MixATIS and MixSNIPS have limitations in their formulation. To
address these issues, we present BlendX, a suite of refined datasets featuring
more diverse patterns than their predecessors, elevating both its complexity
and diversity. For dataset construction, we utilize both rule-based heuristics
as well as a generative tool -- OpenAI's ChatGPT -- which is augmented with a
similarity-driven strategy for utterance selection. To ensure the quality of
the proposed datasets, we also introduce three novel metrics that assess the
statistical properties of an utterance related to word count, conjunction use,
and pronoun usage. Extensive experiments on BlendX reveal that state-of-the-art
MID models struggle with the challenges posed by the new datasets, highlighting
the need to reexamine the current state of the MID field. The dataset is
available at https://github.com/HYU-NLP/BlendX.
comment: Accepted to LREC-COLING2024
☆ RankMamba, Benchmarking Mamba's Document Ranking Performance in the Era of Transformers
Transformer structure has achieved great success in multiple applied machine
learning communities, such as natural language processing (NLP), computer
vision (CV) and information retrieval (IR). Transformer architecture's core
mechanism -- attention requires $O(n^2)$ time complexity in training and $O(n)$
time complexity in inference. Many works have been proposed to improve the
attention mechanism's scalability, such as Flash Attention and Multi-query
Attention. A different line of work aims to design new mechanisms to replace
attention. Recently, a notable model structure -- Mamba, which is based on
state space models, has achieved transformer-equivalent performance in multiple
sequence modeling tasks.
In this work, we examine \mamba's efficacy through the lens of a classical IR
task -- document ranking. A reranker model takes a query and a document as
input, and predicts a scalar relevance score. This task demands the language
model's ability to comprehend lengthy contextual inputs and to capture the
interaction between query and document tokens. We find that (1) Mamba models
achieve competitive performance compared to transformer-based models with the
same training recipe; (2) but also have a lower training throughput in
comparison to efficient transformer implementations such as flash attention. We
hope this study can serve as a starting point to explore Mamba models in other
classical IR tasks. Our code implementation and trained checkpoints are made
public to facilitate
reproducibility.\footnote{https://github.com/zhichaoxu-shufe/RankMamba}.
☆ Toward Interactive Regional Understanding in Vision-Large Language Models NAACL 2024
Recent Vision-Language Pre-training (VLP) models have demonstrated
significant advancements. Nevertheless, these models heavily rely on image-text
pairs that capture only coarse and global information of an image, leading to a
limitation in their regional understanding ability. In this work, we introduce
\textbf{RegionVLM}, equipped with explicit regional modeling capabilities,
allowing them to understand user-indicated image regions. To achieve this, we
design a simple yet innovative architecture, requiring no modifications to the
model architecture or objective function. Additionally, we leverage a dataset
that contains a novel source of information, namely Localized Narratives, which
has been overlooked in previous VLP research. Our experiments demonstrate that
our single generalist model not only achieves an interactive dialogue system
but also exhibits superior performance on various zero-shot region
understanding tasks, without compromising its ability for global image
understanding.
comment: NAACL 2024 Main Conference
☆ MD-PK: Metaphor Detection via Prompt Learning and Knowledge Distillation
Metaphors are ubiquitous in daily life, yet detecting them poses a
significant challenge. Previous approaches often struggled with improper
application of language rules and overlooked the issue of data sparsity. To
address these challenges, we introduce knowledge distillation and prompt
learning into metaphor detection. Specifically, we devise a prompt learning
template tailored for the metaphor detection task. By masking target words and
providing relevant prompt information, we guide the model to accurately infer
the contextual meaning of these words. This approach not only mitigates the
interference from the literal meaning of target words but also ensures the
proper utilization of MIP language rules for metaphor detection. Moreover, we
employ a teacher model equipped with prior knowledge to generate meaningful
soft labels, guiding the optimization process of the student model. The
inclusion of soft labels, akin to label smoothing, helps alleviate the model's
tendency towards over-confidence and effectively addresses the challenge of
data sparsity. Experimental results demonstrate that our proposed model
achieves state-of-the-art performance across multiple datasets.
☆ Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models
Visual representation learning has been a cornerstone in computer vision,
evolving from supervised learning with human-annotated labels to aligning
image-text pairs from the Internet. Despite recent advancements in multi-modal
large language models (MLLMs), the visual representations they rely on, such as
CLIP embeddings, often lack access to external world knowledge critical for
real-world visual reasoning. In this work, we propose Visual Table, a novel
visual representation tailored for MLLMs. It provides hierarchical text
descriptions of holistic visual scenes, consisting of a scene description and
multiple object-centric descriptions that encompass categories, attributes, and
knowledge at instance level. We further develop a scalable generator for visual
table generation and train it on small-scale annotations from GPT4V. Extensive
evaluations demonstrate that, with generated visual tables as additional visual
representations, our model can consistently outperform the state-of-the-art
(SOTA) MLLMs across diverse benchmarks. When visual tables serve as standalone
visual representations, our model can closely match or even beat the SOTA MLLMs
that are built on CLIP visual embeddings. Our code is available at
https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
☆ Since the Scientific Literature Is Multilingual, Our Models Should Be Too
English has long been assumed the $\textit{lingua franca}$ of scientific
research, and this notion is reflected in the natural language processing (NLP)
research involving scientific document representation. In this position piece,
we quantitatively show that the literature is largely multilingual and argue
that current models and benchmarks should reflect this linguistic diversity. We
provide evidence that text-based models fail to create meaningful
representations for non-English papers and highlight the negative user-facing
impacts of using English-only models non-discriminately across a multilingual
domain. We end with suggestions for the NLP community on how to improve
performance on non-English documents.
☆ Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real-World Detection Challenges
Recent advancements in Large Language Models (LLMs) have enabled the creation
of fake news, particularly in complex fields like healthcare. Studies highlight
the gap in the deceptive power of LLM-generated fake news with and without
human assistance, yet the potential of prompting techniques has not been fully
explored. Thus, this work aims to determine whether prompting strategies can
effectively narrow this gap. Current LLM-based fake news attacks require human
intervention for information gathering and often miss details and fail to
maintain context consistency. Therefore, to better understand threat tactics,
we propose a strong fake news attack method called conditional
Variational-autoencoder-Like Prompt (VLPrompt). Unlike current methods,
VLPrompt eliminates the need for additional data collection while maintaining
contextual coherence and preserving the intricacies of the original text. To
propel future research on detecting VLPrompt attacks, we created a new dataset
named VLPrompt fake news (VLPFN) containing real and fake texts. Our
experiments, including various detection methods and novel human study metrics,
were conducted to assess their performance on our dataset, yielding numerous
findings.
☆ ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus LREC
We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech
corpus. The corpus comprises twelve hours of Zoom meetings involving multiple
speakers role-playing a work situation where Students brainstorm ideas for a
certain topic and then discuss it with an Interlocutor. The meetings cover
different topics and are divided into phases with different language setups.
The corpus presents a challenging set for automatic speech recognition (ASR),
including two languages (Arabic and English) with Arabic spoken in multiple
variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English
used with various accents. Adding to the complexity of the corpus, there is
also code-switching between these languages and dialects. As part of our work,
we take inspiration from established sets of transcription guidelines to
present a set of guidelines handling issues of conversational speech,
code-switching and orthography of both languages. We further enrich the corpus
with two layers of annotations; (1) dialectness level annotation for the
portion of the corpus where mixing occurs between different variants of Arabic,
and (2) automatic morphological annotations, including tokenization,
lemmatization, and part-of-speech tagging.
comment: Accepted to LREC-COLING 2024
☆ Mechanisms of non-factual hallucinations in language models
State-of-the-art language models (LMs) sometimes generate non-factual
hallucinations that misalign with world knowledge. Despite extensive efforts to
detect and mitigate hallucinations, understanding their internal mechanisms
remains elusive. Our study investigates the mechanistic causes of
hallucination, specifically non-factual ones where the LM incorrectly predicts
object attributes in response to subject-relation queries. With causal
mediation analysis and embedding space projection, we identify two general
mechanistic causes of hallucinations shared across LMs of various scales and
designs: 1) insufficient subject attribute knowledge in lower layer MLPs, and
2) failing to select the correct object attribute in upper layer attention
heads and MLPs. These two mechanisms exhibit varying degrees of subject-object
association, predictive uncertainty and perturbation robustness. Additionally,
we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics
for the two mechanistic causes of hallucinations. We also highlight how
attribution features from our causal analysis can effectively construct
hallucination detectors. Our work proposes a mechanistic understanding of LM
factual errors.
♻ ☆ Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization
Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, Weiming Lu
Large Language Models exhibit robust problem-solving capabilities for diverse
tasks. However, most LLM-based agents are designed as specific task solvers
with sophisticated prompt engineering, rather than agents capable of learning
and evolving through interactions. These task solvers necessitate manually
crafted prompts to inform task rules and regulate LLM behaviors, inherently
incapacitating to address complex dynamic scenarios e.g., large interactive
games. In light of this, we propose Agent-Pro: an LLM-based Agent with
Policy-level Reflection and Optimization that can learn a wealth of expertise
from interactive experiences and progressively elevate its behavioral policy.
Specifically, it involves a dynamic belief generation and reflection process
for policy evolution. Rather than action-level reflection, Agent-Pro
iteratively reflects on past trajectories and beliefs, fine-tuning its
irrational beliefs for a better policy. Moreover, a depth-first search is
employed for policy optimization, ensuring continual enhancement in policy
payoffs. Agent-Pro is evaluated across two games: Blackjack and Texas Hold'em,
outperforming vanilla LLM and specialized models. Our results show Agent-Pro
can learn and evolve in complex and dynamic scenes, which also benefits
numerous LLM-based applications.
comment: LLM-based Agent
♻ ☆ Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives
The reflection capacity of Large Language Model (LLM) has garnered extensive
attention. A post-hoc prompting strategy, e.g., reflexion and self-refine,
refines LLM's response based on self-evaluated or external feedback. However,
recent research indicates without external feedback, LLM's intrinsic reflection
is unstable. Our investigation unveils that the key bottleneck is the quality
of the self-evaluated feedback. We find LLMs often exhibit overconfidence or
high randomness when self-evaluate, offering stubborn or inconsistent feedback,
which causes poor reflection. To remedy this, we advocate Self-Contrast: It
adaptively explores diverse solving perspectives tailored to the request,
contrasts the differences, and summarizes these discrepancies into a checklist
which could be used to re-examine and eliminate discrepancies. Our method
endows LLM with diverse perspectives to alleviate stubborn biases. Moreover,
their discrepancies indicate potential errors or inherent uncertainties that
LLM often overlooks. Reflecting upon these can catalyze more accurate and
stable reflection. Experiments conducted on a series of reasoning and
translation tasks with different LLMs serve to underscore the effectiveness and
generality of our strategy.
♻ ☆ NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
While recent large-scale text-to-speech (TTS) models have achieved
significant progress, they still fall short in speech quality, similarity, and
prosody. Considering speech intricately encompasses various attributes (e.g.,
content, prosody, timbre, and acoustic details) that pose significant
challenges for generation, a natural idea is to factorize speech into
individual subspaces representing different attributes and generate them
individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with
novel factorized diffusion models to generate natural speech in a zero-shot
way. Specifically, 1) we design a neural codec with factorized vector
quantization (FVQ) to disentangle speech waveform into subspaces of content,
prosody, timbre, and acoustic details; 2) we propose a factorized diffusion
model to generate attributes in each subspace following its corresponding
prompt. With this factorization design, NaturalSpeech 3 can effectively and
efficiently model intricate speech with disentangled subspaces in a
divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the
state-of-the-art TTS systems on quality, similarity, prosody, and
intelligibility, and achieves on-par quality with human recordings.
Furthermore, we achieve better performance by scaling to 1B parameters and 200K
hours of training data.
comment: Achieving human-level quality and naturalness on multi-speaker
datasets (e.g., LibriSpeech) in a zero-shot way
♻ ☆ ChatGPT Needs SPADE (Sustainability, PrivAcy, Digital divide, and Ethics) Evaluation: A Review
ChatGPT is another large language model (LLM) vastly available for the
consumers on their devices but due to its performance and ability to converse
effectively, it has gained a huge popularity amongst research as well as
industrial community. Recently, many studies have been published to show the
effectiveness, efficiency, integration, and sentiments of chatGPT and other
LLMs. In contrast, this study focuses on the important aspects that are mostly
overlooked, i.e. sustainability, privacy, digital divide, and ethics and
suggests that not only chatGPT but every subsequent entry in the category of
conversational bots should undergo Sustainability, PrivAcy, Digital divide, and
Ethics (SPADE) evaluation. This paper discusses in detail the issues and
concerns raised over chatGPT in line with aforementioned characteristics. We
also discuss the recent EU AI Act briefly in accordance with the SPADE
evaluation. We support our hypothesis by some preliminary data collection and
visualizations along with hypothesized facts. We also suggest mitigations and
recommendations for each of the concerns. Furthermore, we also suggest some
policies and recommendations for EU AI policy act concerning ethics, digital
divide, and sustainability.
comment: 29 pages, 8 figures, 4 tables
♻ ☆ Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities LREC
With the rise of Large Language Models (LLMs), AI assistants' ability to
utilize tools, especially through API calls, has advanced notably. This
progress has necessitated more accurate evaluation methods. Many existing
studies adopt static evaluation, where they assess AI assistants' API call
based on pre-defined dialogue histories. However, such evaluation method can be
misleading, as an AI assistant might fail in generating API calls from
preceding human interaction in real cases. Instead of the resource-intensive
method of direct human-machine interactions, we propose Automated Dynamic
Evaluation (AutoDE) to assess an assistant's API call capability without human
involvement. In our framework, we endeavor to closely mirror genuine human
conversation patterns in human-machine interactions, using a LLM-based user
agent, equipped with a user script to ensure human alignment. Experimental
results highlight that AutoDE uncovers errors overlooked by static evaluations,
aligning more closely with human assessment. Testing four AI assistants using
our crafted benchmark, our method further mirrored human evaluation compared to
conventional static evaluations.
comment: Accepted at LREC-COLING 2024
♻ ☆ Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language LREC
Relation extraction is essential for extracting and understanding
biographical information in the context of digital humanities and related
subjects. There is a growing interest in the community to build datasets
capable of training machine learning models to extract relationships. However,
annotating such datasets can be expensive and time-consuming, in addition to
being limited to English. This paper applies guided distant supervision to
create a large biographical relationship extraction dataset for German. Our
dataset, composed of more than 80,000 instances for nine relationship types, is
the largest biographical German relationship extraction dataset. We also create
a manually annotated dataset with 2000 instances to evaluate the models and
release it together with the dataset compiled using guided distant supervision.
We train several state-of-the-art machine learning models on the automatically
created dataset and release them as well. Furthermore, we experiment with
multilingual and cross-lingual experiments that could benefit many low-resource
languages.
comment: Accepted to LREC-COLING 2024 (The 2024 Joint International Conference
on Computational Linguistics, Language Resources and Evaluation)
♻ ☆ GlotScript: A Resource and Tool for Low Resource Writing System Identification LREC
We present GlotScript, an open resource and tool for low resource writing
system identification. GlotScript-R is a resource that provides the attested
writing systems for more than 7,000 languages. It is compiled by aggregating
information from existing writing system resources. GlotScript-T is a writing
system identification tool that covers all 161 Unicode 15.0 scripts. For an
input text, it returns its script distribution where scripts are identified by
ISO 15924 codes. We also present two use cases for GlotScript. First, we
demonstrate that GlotScript can help cleaning multilingual corpora such as mC4
and OSCAR. Second, we analyze the tokenization of a number of language models
such as GPT-4 using GlotScript and provide insights on the coverage of low
resource scripts and languages by each language model. We hope that GlotScript
will become a useful resource for work on low resource languages in the NLP
community. GlotScript-R and GlotScript-T are available at
https://github.com/cisnlp/GlotScript.
comment: LREC-COLING 2024
♻ ☆ NLPre: a revised approach towards language-centric benchmarking of Natural Language Preprocessing systems LREC
With the advancements of transformer-based architectures, we observe the rise
of natural language preprocessing (NLPre) tools capable of solving preliminary
NLP tasks (e.g. tokenisation, part-of-speech tagging, dependency parsing, or
morphological analysis) without any external linguistic guidance. It is arduous
to compare novel solutions to well-entrenched preprocessing toolkits, relying
on rule-based morphological analysers or dictionaries. Aware of the
shortcomings of existing NLPre evaluation approaches, we investigate a novel
method of reliable and fair evaluation and performance reporting. Inspired by
the GLUE benchmark, the proposed language-centric benchmarking system enables
comprehensive ongoing evaluation of multiple NLPre tools, while credibly
tracking their performance. The prototype application is configured for Polish
and integrated with the thoroughly assembled NLPre-PL benchmark. Based on this
benchmark, we conduct an extensive evaluation of a variety of Polish NLPre
systems. To facilitate the construction of benchmarking environments for other
languages, e.g. NLPre-GA for Irish or NLPre-ZH for Chinese, we ensure full
customization of the publicly released source code of the benchmarking system.
The links to all the resources (deployed platforms, source code, trained
models, datasets etc.) can be found on the project website:
https://sites.google.com/view/nlpre-benchmark.
comment: Accepted at LREC-COLING 2024
♻ ☆ Structure Guided Large Language Model for SQL Generation
Generating accurate Structured Querying Language (SQL) is a long-standing
problem, especially in matching users' semantic queries with structured
databases and then generating structured SQL. Existing models typically input
queries and database schemas into the LLM and rely on the LLM to perform
semantic-structure matching and generate structured SQL. However, such
solutions overlook the structural information within user queries and
databases, which can be utilized to enhance the generation of structured SQL.
This oversight can lead to inaccurate or unexecutable SQL generation. To fully
exploit the structure, we propose a structure-to-SQL framework, which leverages
the inherent structure information to improve the SQL generation of LLMs.
Specifically, we introduce our Structure Guided SQL~(SGU-SQL) generation model.
SGU-SQL first links user queries and databases in a structure-enhanced manner.
It then decomposes complicated linked structures with grammar trees to guide
the LLM to generate the SQL step by step. Extensive experiments on two
benchmark datasets illustrate that SGU-SQL can outperform sixteen SQL
generation baselines.
♻ ☆ Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism
Neural Information Retrieval (NIR) has significantly improved upon
heuristic-based IR systems. Yet, failures remain frequent, the models used
often being unable to retrieve documents relevant to the user's query. We
address this challenge by proposing a lightweight abstention mechanism tailored
for real-world constraints, with particular emphasis placed on the reranking
phase. We introduce a protocol for evaluating abstention strategies in a
black-box scenario, demonstrating their efficacy, and propose a simple yet
effective data-driven mechanism. We provide open-source code for experiment
replication and abstention implementation, fostering wider adoption and
application in diverse contexts.
♻ ☆ Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey NAACL 2024
Large Language Models (LLMs) are now commonplace in conversation
applications. However, their risks of misuse for generating harmful responses
have raised serious societal concerns and spurred recent research on LLM
conversation safety. Therefore, in this survey, we provide a comprehensive
overview of recent studies, covering three critical aspects of LLM conversation
safety: attacks, defenses, and evaluations. Our goal is to provide a structured
summary that enhances understanding of LLM conversation safety and encourages
further investigation into this important subject. For easy reference, we have
categorized all the studies mentioned in this survey according to our taxonomy,
available at: https://github.com/niconi19/LLM-conversation-safety.
comment: Accepted to NAACL 2024
♻ ☆ CARE: Co-Attention Network for Joint Entity and Relation Extraction LREC
Joint entity and relation extraction is the fundamental task of information
extraction, consisting of two subtasks: named entity recognition and relation
extraction. However, most existing joint extraction methods suffer from issues
of feature confusion or inadequate interaction between the two subtasks.
Addressing these challenges, in this work, we propose a Co-Attention network
for joint entity and Relation Extraction (CARE). Our approach includes adopting
a parallel encoding strategy to learn separate representations for each
subtask, aiming to avoid feature overlap or confusion. At the core of our
approach is the co-attention module that captures two-way interaction between
the two subtasks, allowing the model to leverage entity information for
relation prediction and vice versa, thus promoting mutual enhancement. Through
extensive experiments on three benchmark datasets for joint entity and relation
extraction (NYT, WebNLG, and SciERC), we demonstrate that our proposed model
outperforms existing baseline models. Our code will be available at
https://github.com/kwj0x7f/CARE.
comment: Accepted by LREC-COLING 2024
♻ ☆ Few-Shot Detection of Machine-Generated Text using Style Representations
The advent of instruction-tuned language models that convincingly mimic human
writing poses a significant risk of abuse. However, such abuse may be
counteracted with the ability to detect whether a piece of text was composed by
a language model rather than a human author. Some previous approaches to this
problem have relied on supervised methods by training on corpora of confirmed
human- and machine- written documents. Unfortunately, model under-specification
poses an unavoidable challenge for neural network-based detectors, making them
brittle in the face of data shifts, such as the release of newer language
models producing still more fluent text than the models used to train the
detectors. Other approaches require access to the models that may have
generated a document in question, which is often impractical. In light of these
challenges, we pursue a fundamentally different approach not relying on samples
from language models of concern at training time. Instead, we propose to
leverage representations of writing style estimated from human-authored text.
Indeed, we find that features effective at distinguishing among human authors
are also effective at distinguishing human from machine authors, including
state-of-the-art large language models like Llama-2, ChatGPT, and GPT-4.
Furthermore, given a handful of examples composed by each of several specific
language models of interest, our approach affords the ability to predict which
model generated a given document. The code and data to reproduce our
experiments are available at
https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.
♻ ☆ A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily NAACL 2024
Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to
provide useful and safe responses. However, adversarial prompts known as
'jailbreaks' can circumvent safeguards, leading LLMs to generate potentially
harmful content. Exploring jailbreak prompts can help to better reveal the
weaknesses of LLMs and further steer us to secure them. Unfortunately, existing
jailbreak methods either suffer from intricate manual design or require
optimization on other white-box models, which compromises either generalization
or efficiency. In this paper, we generalize jailbreak prompt attacks into two
aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we
propose ReNeLLM, an automatic framework that leverages LLMs themselves to
generate effective jailbreak prompts. Extensive experiments demonstrate that
ReNeLLM significantly improves the attack success rate while greatly reducing
the time cost compared to existing baselines. Our study also reveals the
inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze
the failure of LLMs defense from the perspective of prompt execution priority,
and propose corresponding defense strategies. We hope that our research can
catalyze both the academic community and LLMs developers towards the provision
of safer and more regulated LLMs. The code is available at
https://github.com/NJUNLP/ReNeLLM.
comment: Acccepted by NAACL 2024, 18 pages, 7 figures, 13 tables
♻ ☆ Visually Guided Generative Text-Layout Pre-training for Document Intelligence NAACL 2024
Prior study shows that pre-training techniques can boost the performance of
visual document understanding (VDU), which typically requires models to gain
abilities to perceive and reason both document texts and layouts (e.g.,
locations of texts and table-cells). To this end, we propose visually guided
generative text-layout pre-training, named ViTLP. Given a document image, the
model optimizes hierarchical language and layout modeling objectives to
generate the interleaved text and layout sequence. In addition, to address the
limitation of processing long documents by Transformers, we introduce a
straightforward yet effective multi-segment generative pre-training scheme,
facilitating ViTLP to process word-intensive documents of any length. ViTLP can
function as a native OCR model to localize and recognize texts of document
images. Besides, ViTLP can be effectively applied to various downstream VDU
tasks. Extensive experiments show that ViTLP achieves competitive performance
over existing baselines on benchmark VDU tasks, including information
extraction, document classification, and document question answering.
comment: Accepted to NAACL 2024 main conference. The first version of this
paper was submitted to OpenReview
(https://openreview.net/forum?id=ARtBIBAmNR) in June 2023
♻ ☆ $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models NAACL2024
Prompt-based learning is a new language model training paradigm that adapts
the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes
the performance benchmarks across various natural language processing (NLP)
tasks. Instead of using a fixed prompt template to fine-tune the model, some
research demonstrates the effectiveness of searching for the prompt via
optimization. Such prompt optimization process of prompt-based learning on PLMs
also gives insight into generating adversarial prompts to mislead the model,
raising concerns about the adversarial vulnerability of this paradigm. Recent
studies have shown that universal adversarial triggers (UATs) can be generated
to alter not only the predictions of the target PLMs but also the prediction of
corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based
learning paradigm. However, UATs found in previous works are often unreadable
tokens or characters and can be easily distinguished from natural texts with
adaptive defenses. In this work, we consider the naturalness of the UATs and
develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs
by a gradient-based beam search algorithm that not only effectively attacks the
target PLMs and PFMs but also maintains the naturalness among the trigger
tokens. Extensive results demonstrate the effectiveness of
$\textit{LinkPrompt}$, as well as the transferability of UATs generated by
$\textit{LinkPrompt}$ to open-sourced Large Language Model (LLM) Llama2 and
API-accessed LLM GPT-3.5-turbo.
comment: Accepted to the main conference of NAACL2024
♻ ☆ LLatrieval: LLM-Verified Retrieval for Verifiable Generation NAACL 2024
Verifiable generation aims to let the large language model (LLM) generate
text with supporting documents, which enables the user to flexibly verify the
answer and makes the LLM's output more reliable. Retrieval plays a crucial role
in verifiable generation. Specifically, the retrieved documents not only
supplement knowledge to help the LLM generate correct answers, but also serve
as supporting evidence for the user to verify the LLM's output. However, the
widely used retrievers become the bottleneck of the entire pipeline and limit
the overall performance. Their capabilities are usually inferior to LLMs since
they often have much fewer parameters than the large language model and have
not been demonstrated to scale well to the size of LLMs. If the retriever does
not correctly find the supporting documents, the LLM can not generate the
correct and verifiable answer, which overshadows the LLM's remarkable
abilities. To address these limitations, we propose \LLatrieval (Large Language
Model Verified Retrieval), where the LLM updates the retrieval result until it
verifies that the retrieved documents can sufficiently support answering the
question. Thus, the LLM can iteratively provide feedback to retrieval and
facilitate the retrieval result to fully support verifiable generation.
Experiments show that LLatrieval significantly outperforms extensive baselines
and achieves state-of-the-art results.
comment: Accepted by NAACL 2024 (Main Conference)
♻ ☆ InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling AAAI2023
Cross-lingual topic models have been prevalent for cross-lingual text
analysis by revealing aligned latent topics. However, most existing methods
suffer from producing repetitive topics that hinder further analysis and
performance decline caused by low-coverage dictionaries. In this paper, we
propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM).
Instead of the direct alignment in previous work, we propose a topic alignment
with mutual information method. This works as a regularization to properly
align topics and prevent degenerate topic representations of words, which
mitigates the repetitive topic issue. To address the low-coverage dictionary
issue, we further propose a cross-lingual vocabulary linking method that finds
more linked cross-lingual words for topic alignment beyond the translations of
a given dictionary. Extensive experiments on English, Chinese, and Japanese
datasets demonstrate that our method outperforms state-of-the-art baselines,
producing more coherent, diverse, and well-aligned topics and showing better
transferability for cross-lingual classification tasks.
comment: Accepted to AAAI2023 conference. Code is available at
https://github.com/BobXWu/InfoCTM
♻ ☆ From Text to Source: Results in Detecting Large Language Model-Generated Content COLING
The widespread use of Large Language Models (LLMs), celebrated for their
ability to generate human-like text, has raised concerns about misinformation
and ethical implications. Addressing these concerns necessitates the
development of robust methods to detect and attribute text generated by LLMs.
This paper investigates "Cross-Model Detection," by evaluating whether a
classifier trained to distinguish between source LLM-generated and
human-written text can also detect text from a target LLM without further
training. The study comprehensively explores various LLM sizes and families,
and assesses the impact of conversational fine-tuning techniques, quantization,
and watermarking on classifier generalization. The research also explores Model
Attribution, encompassing source model identification, model family, and model
size classification, in addition to quantization and watermarking detection.
Our results reveal several key findings: a clear inverse relationship between
classifier effectiveness and model size, with larger LLMs being more
challenging to detect, especially when the classifier is trained on data from
smaller models. Training on data from similarly sized LLMs can improve
detection performance from larger models but may lead to decreased performance
when dealing with smaller models. Additionally, model attribution experiments
show promising results in identifying source models and model families,
highlighting detectable signatures in LLM-generated text, with particularly
remarkable outcomes in watermarking detection, while no detectable signatures
of quantization were observed. Overall, our study contributes valuable insights
into the interplay of model size, family, and training data in LLM detection
and attribution.
comment: Accepted to COLING-LREC 2024
♻ ☆ OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
To help the open-source community have a better understanding of
Mixture-of-Experts (MoE) based large language models (LLMs), we train and
release OpenMoE, a series of fully open-sourced and reproducible decoder-only
MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T
tokens. Our investigation confirms that MoE-based LLMs can offer a more
favorable cost-effectiveness trade-off than dense LLMs, highlighting the
potential effectiveness for future LLM development.
One more important contribution of this study is an in-depth analysis of the
routing mechanisms within our OpenMoE models, leading to three significant
findings: Context-Independent Specialization, Early Routing Learning, and
Drop-towards-the-End. We discovered that routing decisions in MoE models are
predominantly based on token IDs, with minimal context relevance. The
token-to-expert assignments are determined early in the pre-training phase and
remain largely unchanged. This imperfect routing can result in performance
degradation, particularly in sequential tasks like multi-turn conversations,
where tokens appearing later in a sequence are more likely to be dropped.
Finally, we rethink our design based on the above-mentioned observations and
analysis. To facilitate future MoE LLM development, we propose potential
strategies for mitigating the issues we found and further improving
off-the-shelf MoE LLM designs.
♻ ☆ Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering LREC
The large success of deep learning based methods in Visual Question Answering
(VQA) has concurrently increased the demand for explainable methods. Most
methods in Explainable Artificial Intelligence (XAI) focus on generating
post-hoc explanations rather than taking an intrinsic approach, the latter
characterizing an interpretable model. In this work, we introduce an
interpretable approach for graph-based VQA and demonstrate competitive
performance on the GQA dataset. This approach bridges the gap between
interpretability and performance. Our model is designed to intrinsically
produce a subgraph during the question-answering process as its explanation,
providing insight into the decision making. To evaluate the quality of these
generated subgraphs, we compare them against established post-hoc
explainability methods for graph neural networks, and perform a human
evaluation. Moreover, we present quantitative metrics that correlate with the
evaluations of human assessors, acting as automatic metrics for the generated
explanatory subgraphs. Our implementation is available at
https://github.com/DigitalPhonetics/Intrinsic-Subgraph-Generation-for-VQA.
comment: Accepted at LREC-COLING 2024
♻ ☆ Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang
Large Language Models (LLMs) showcase impressive capabilities but encounter
challenges like hallucination, outdated knowledge, and non-transparent,
untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has
emerged as a promising solution by incorporating knowledge from external
databases. This enhances the accuracy and credibility of the generation,
particularly for knowledge-intensive tasks, and allows for continuous knowledge
updates and integration of domain-specific information. RAG synergistically
merges LLMs' intrinsic knowledge with the vast, dynamic repositories of
external databases. This comprehensive review paper offers a detailed
examination of the progression of RAG paradigms, encompassing the Naive RAG,
the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the
tripartite foundation of RAG frameworks, which includes the retrieval, the
generation and the augmentation techniques. The paper highlights the
state-of-the-art technologies embedded in each of these critical components,
providing a profound understanding of the advancements in RAG systems.
Furthermore, this paper introduces up-to-date evaluation framework and
benchmark. At the end, this article delineates the challenges currently faced
and points out prospective avenues for research and development.
comment: Ongoing Work
♻ ☆ ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus LREC
We introduce \`{I}r\`{o}y\`{i}nSpeech, a new corpus influenced by the desire
to increase the amount of high quality, contemporary Yor\`{u}b\'{a} speech
data, which can be used for both Text-to-Speech (TTS) and Automatic Speech
Recognition (ASR) tasks. We curated about 23000 text sentences from news and
creative writing domains with the open license CC-BY-4.0. To encourage a
participatory approach to data creation, we provide 5000 curated sentences to
the Mozilla Common Voice platform to crowd-source the recording and validation
of Yor\`{u}b\'{a} speech data. In total, we created about 42 hours of speech
data recorded by 80 volunteers in-house, and 6 hours of validated recordings on
Mozilla Common Voice platform. Our TTS evaluation suggests that a
high-fidelity, general domain, single-speaker Yor\`{u}b\'{a} voice is possible
with as little as 5 hours of speech. Similarly, for ASR we obtained a baseline
word error rate (WER) of 23.8.
comment: Accepted to LREC-COLING 2024
♻ ☆ Centered Masking for Language-Image Pre-Training
We introduce Gaussian masking for Language-Image Pre-Training (GLIP) a novel,
straightforward, and effective technique for masking image patches during
pre-training of a vision-language model. GLIP builds on Fast Language-Image
Pre-Training (FLIP), which randomly masks image patches while training a CLIP
model. GLIP replaces random masking with centered masking, that uses a Gaussian
distribution and is inspired by the importance of image patches at the center
of the image. GLIP retains the same computational savings as FLIP, while
improving performance across a range of downstream datasets and tasks, as
demonstrated by our experimental results. We show the benefits of GLIP to be
easy to obtain, requiring no delicate tuning of the Gaussian, and also
applicable to data sets containing images without an obvious center focus.
♻ ☆ Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space EACL 2023
Prior research has investigated the impact of various linguistic features on
cross-lingual transfer performance. In this study, we investigate the manner in
which this effect can be mapped onto the representation space. While past
studies have focused on the impact on cross-lingual alignment in multilingual
language models during fine-tuning, this study examines the absolute evolution
of the respective language representation spaces produced by MLLMs. We place a
specific emphasis on the role of linguistic characteristics and investigate
their inter-correlation with the impact on representation spaces and
cross-lingual transfer performance. Additionally, this paper provides
preliminary evidence of how these findings can be leveraged to enhance transfer
to linguistically distant languages.
comment: SIGTYP Workshop 2023 (co-located with EACL 2023)
♻ ☆ X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment
Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim
The impressive development of large language models (LLMs) is expanding into
the realm of large multimodal models (LMMs), which incorporate multiple types
of data beyond text. However, the nature of multimodal models leads to
significant expenses in the creation of training data. Furthermore,
constructing multilingual data for LMMs presents its own set of challenges due
to language diversity and complexity. Therefore, in this study, we propose two
cost-effective methods to solve this problem: (1) vocabulary expansion and
pretraining of multilingual LLM for specific languages, and (2) automatic and
elaborate construction of multimodal datasets using GPT4-V. Based on015 these
methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal
training dataset. Additionally, we developed a bilingual multimodal model that
exhibits excellent performance in both Korean and English, surpassing existing
approaches.
♻ ☆ Adapting Knowledge for Few-shot Table-to-Text Generation
Pretrained language models (PLMs) have made remarkable progress in
table-to-text generation tasks. However, the lack of domain-specific knowledge
makes it challenging to bridge the topological gap between tabular data and
text, especially in real-world applications with limited resources. To mitigate
the limitation of insufficient labeled data, we propose a novel framework:
Adapt-Knowledge-to-Generate (AKG). The core insight of AKG is to adapt
unlabeled domain-specific knowledge into the model, which brings at least three
benefits: (1) it injects representation of normal table-related descriptions to
bridge the topological gap between tabular data and texts; (2) it enables us to
use large amounts of unlabeled domain-specific knowledge fully, which can
alleviate the PLMs' inherent shortcomings of lacking domain knowledge; (3) it
allows us to design various tasks to employ the domain-specific knowledge.
Extensive experiments and analyses are conducted on three open-domain, few-shot
natural language generation (NLG) data sets: Humans, Songs, and Books. Compared
to previous state-of-the-art approaches, our model achieves superior
performance in terms of both fluency and accuracy as judged by human and
automatic evaluations.
comment: arXiv admin note: substantial text overlap with arXiv:2302.04415
♻ ☆ EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction
To address intricate real-world tasks, there has been a rising interest in
tool utilization in applications of large language models (LLMs). To develop
LLM-based agents, it usually requires LLMs to understand many tool functions
from different tool documentation. But these documentations could be diverse,
redundant or incomplete, which immensely affects the capability of LLMs in
using tools. To solve this, we introduce EASYTOOL, a framework transforming
diverse and lengthy tool documentation into a unified and concise tool
instruction for easier tool usage. EasyTool purifies essential information from
extensive tool documentation of different sources, and elaborates a unified
interface (i.e., tool instruction) to offer standardized tool descriptions and
functionalities for LLM-based agents. Extensive experiments on multiple
different tasks demonstrate that EasyTool can significantly reduce token
consumption and improve the performance of tool utilization in real-world
scenarios. Our code will be available at
\url{https://github.com/microsoft/JARVIS/} in the future.
♻ ☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform
diverse tasks in underrepresented languages using only short in-context
information, offering a crucial avenue for narrowing the gap between
high-resource and low-resource languages. Nonetheless, there is only a handful
of works explored ICL for low-resource languages with most of them focusing on
relatively high-resource languages, such as French and Spanish. In this work,
we extensively study ICL and its cross-lingual variation (X-ICL) on 25
low-resource and 7 relatively higher-resource languages. Our study not only
assesses the effectiveness of ICL with LLMs in low-resource languages but also
identifies the shortcomings of in-context label alignment, and introduces a
more effective alternative: query alignment. Moreover, we provide valuable
insights into various facets of ICL for low-resource languages. Our study
concludes the significance of few-shot in-context information on enhancing the
low-resource understanding quality of LLMs through semantically relevant
information by closing the language gap in the target language and aligning the
semantics between the targeted low-resource and the high-resource language that
the model is proficient in. Our work highlights the importance of advancing ICL
research, particularly for low-resource languages.
♻ ☆ Mix-Initiative Response Generation with Dynamic Prefix Tuning NAACL 2024
Mixed initiative serves as one of the key factors in controlling conversation
directions. For a speaker, responding passively or leading proactively would
result in rather different responses. However, most dialogue systems focus on
training a holistic response generation model without any distinction among
different initiatives. It leads to the cross-contamination problem, where the
model confuses different initiatives and generates inappropriate responses.
Moreover, obtaining plenty of human annotations for initiative labels can be
expensive. To address this issue, we propose a general mix-Initiative Dynamic
Prefix Tuning framework (IDPT) to decouple different initiatives from the
generation model, which learns initiative-aware prefixes in both supervised and
unsupervised settings. Specifically, IDPT decouples initiative factors into
different prefix parameters and uses the attention mechanism to adjust the
selection of initiatives in guiding generation dynamically. The prefix
parameters can be tuned towards accurate initiative prediction as well as
mix-initiative response generation. Extensive experiments on two public
dialogue datasets show that the proposed IDPT outperforms previous baselines on
both automatic metrics and human evaluations. It also manages to generate
appropriate responses with manipulated initiatives.
comment: Accepted to the main conference of NAACL 2024
♻ ☆ PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models NAACL 2024
Pre-trained language models (PLMs) show impressive performance in various
downstream NLP tasks. However, pre-training large language models demands
substantial memory and training compute. Furthermore, due to the substantial
resources required, many PLM weights are confidential. Consequently, users are
compelled to share their data with model owners for fine-tuning specific tasks.
To overcome the limitations, we introduce Plug-in External Memory Adaptation
(PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM
fine-tuning without requiring access to all the weights. PEMA integrates with
context representations from test data during inference to perform downstream
tasks. It uses external memory to store PLM-generated context representations
mapped with target tokens. Our method utilizes weight matrices of LoRA-like
bottlenecked adapter in the PLM's final layer to enhance efficiency. Our
approach also includes Gradual Unrolling, a novel interpolation strategy to
improve generation quality. We validate PEMA's effectiveness through
experiments on syntactic and real datasets for machine translation and style
transfer. Our findings show that PEMA outperforms other PEFT approaches in
memory and latency efficiency for training, and also excels in maintaining
sentence meaning and generating appropriate language and styles.
comment: Accepted to NAACL 2024
♻ ☆ ProSwitch: Knowledge-Guided Language Model Fine-Tuning to Generate Professional and Non-Professional Styled Text
Large Language Models (LLMs) have demonstrated efficacy in various linguistic
applications, including text summarization and controlled text generation.
However, studies into their capacity of switching between styles via
fine-tuning remain underexplored. This study concentrates on textual
professionalism and introduces a novel methodology, named ProSwitch, which
equips a language model with the ability to produce both professional and
non-professional responses through knowledge-guided instruction tuning.
ProSwitch unfolds across three phases: data preparation for gathering domain
knowledge and training corpus; instruction tuning for optimizing language
models with multiple levels of instruction formats; and comprehensive
evaluation for assessing the professionalism discrimination and reference-based
quality of generated text. Comparative analysis of ProSwitch against both
general and specialized language models reveals that our approach outperforms
baselines in switching between professional and non-professional text
generation.
comment: 8 pages
♻ ☆ CBQ: Cross-Block Quantization for Large Language Models
Xin Ding, Xiaoyu Liu, Zhijun Tu, Yun Zhang, Wei Li, Jie Hu, Hanting Chen, Yehui Tang, Zhiwei Xiong, Baoqun Yin, Yunhe Wang
Post-training quantization (PTQ) has played a key role in compressing large
language models (LLMs) with ultra-low costs. However, existing PTQ methods only
focus on handling the outliers within one layer or one block, which ignores the
dependency of blocks and leads to severe performance degradation in low-bit
settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ
method for LLMs. CBQ employs a cross-block dependency using a homologous
reconstruction scheme, establishing long-range dependencies across multiple
blocks to minimize error accumulation. Furthermore, CBQ incorporates a
coarse-to-fine preprocessing (CFP) strategy for suppressing weight and
activation outliers, coupled with an adaptive LoRA-Rounding technique for
precise weight quantization. These innovations enable CBQ to not only handle
extreme outliers effectively but also improve overall quantization accuracy.
Extensive experiments show that CBQ achieves superior low-bit quantization
(W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across
various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model
within only 4.3 hours on a single GPU, achieving a commendable tradeoff between
performance and quantization efficiency.
♻ ☆ Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks LREC
Recent explorations with commercial Large Language Models (LLMs) have shown
that non-expert users can jailbreak LLMs by simply manipulating their prompts;
resulting in degenerate output behavior, privacy and security breaches,
offensive outputs, and violations of content regulator policies. Limited
studies have been conducted to formalize and analyze these attacks and their
mitigations. We bridge this gap by proposing a formalism and a taxonomy of
known (and possible) jailbreaks. We survey existing jailbreak methods and their
effectiveness on open-source and commercial LLMs (such as GPT-based models,
OPT, BLOOM, and FLAN-T5-XXL). We further discuss the challenges of jailbreak
detection in terms of their effectiveness against known attacks. For further
analysis, we release a dataset of model outputs across 3700 jailbreak prompts
over 4 tasks.
comment: Accepted at LREC-COLING 2024 - The 2024 Joint International
Conference on Computational Linguistics, Language Resources and Evaluation
♻ ☆ BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning AAAI 2023
Vision-Language (VL) models with the Two-Tower architecture have dominated
visual-language representation learning in recent years. Current VL models
either use lightweight uni-modal encoders and learn to extract, align and fuse
both modalities simultaneously in a deep cross-modal encoder, or feed the
last-layer uni-modal representations from the deep pre-trained uni-modal
encoders into the top cross-modal encoder. Both approaches potentially restrict
vision-language representation learning and limit model performance. In this
paper, we propose BridgeTower, which introduces multiple bridge layers that
build a connection between the top layers of uni-modal encoders and each layer
of the cross-modal encoder. This enables effective bottom-up cross-modal
alignment and fusion between visual and textual representations of different
semantic levels of pre-trained uni-modal encoders in the cross-modal encoder.
Pre-trained with only 4M images, BridgeTower achieves state-of-the-art
performance on various downstream vision-language tasks. In particular, on the
VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming
the previous state-of-the-art model METER by 1.09% with the same pre-training
data and almost negligible additional parameters and computational costs.
Notably, when further scaling the model, BridgeTower achieves an accuracy of
81.15%, surpassing models that are pre-trained on orders-of-magnitude larger
datasets. Code and checkpoints are available at
https://github.com/microsoft/BridgeTower.
comment: Accepted by AAAI 2023, Oral
♻ ☆ Dial-MAE: ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems NAACL 2024
Dialogue response selection aims to select an appropriate response from
several candidates based on a given user and system utterance history. Most
existing works primarily focus on post-training and fine-tuning tailored for
cross-encoders. However, there are no post-training methods tailored for dense
encoders in dialogue response selection. We argue that when the current
language model, based on dense dialogue systems (such as BERT), is employed as
a dense encoder, it separately encodes dialogue context and response, leading
to a struggle to achieve the alignment of both representations. Thus, we
propose Dial-MAE (Dialogue Contextual Masking Auto-Encoder), a straightforward
yet effective post-training technique tailored for dense encoders in dialogue
response selection. Dial-MAE uses an asymmetric encoder-decoder architecture to
compress the dialogue semantics into dense vectors, which achieves better
alignment between the features of the dialogue context and response. Our
experiments have demonstrated that Dial-MAE is highly effective, achieving
state-of-the-art performance on two commonly evaluated benchmarks.
comment: This paper has been accepted by NAACL 2024
♻ ☆ SoftTiger: A Clinical Foundation Model for Healthcare Workflows
We introduce SoftTiger, a clinical large language model (CLaM) designed as a
foundation model for healthcare workflows. The narrative and unstructured
nature of clinical notes is a major obstacle for healthcare intelligentization.
We address a critical problem of structuring clinical notes into clinical data,
according to international interoperability standards. We collect and annotate
data for three subtasks, namely, international patient summary, clinical
impression and medical encounter. We then supervised fine-tuned a
state-of-the-art LLM using public and credentialed clinical data. The training
is orchestrated in a way that the target model can first support basic clinical
tasks such as abbreviation expansion and temporal information extraction, and
then learn to perform more complex downstream clinical tasks. Moreover, we
address several modeling challenges in the healthcare context, e.g., extra long
context window. Our blind pairwise evaluation shows that SoftTiger outperforms
other popular open-source models and GPT-3.5, comparable to Gemini-pro, with a
mild gap from GPT-4. We believe that LLMs may become a step-stone towards
healthcare digitalization and democratization. Therefore, we publicly release
SoftTiger models at scales of 13 billion and 70 billion parameters, as well as
datasets and code for our innovative scalable evaluation, hopefully, making a
significant contribution to the healthcare industry.
♻ ☆ Probing Multimodal Large Language Models for Global and Local Semantic Representations LREC
The advancement of Multimodal Large Language Models (MLLMs) has greatly
accelerated the development of applications in understanding integrated texts
and images. Recent works leverage image-caption datasets to train MLLMs,
achieving state-of-the-art performance on image-to-text tasks. However, there
are few studies exploring which layers of MLLMs make the most effort to the
global image information, which plays vital roles in multimodal comprehension
and generation. In this study, we find that the intermediate layers of models
can encode more global semantic information, whose representation vectors
perform better on visual-language entailment tasks, rather than the topmost
layers. We further probe models regarding local semantic representations
through object recognition tasks. We find that the topmost layers may
excessively focus on local information, leading to a diminished ability to
encode global information. Our code and data are released via
https://github.com/kobayashikanna01/probing_MLLM_rep.
comment: Accepted by LREC-COLING 2024 as a short paper (Camera Ready)
♻ ☆ Language Models are Free Boosters for Biomedical Imaging Tasks
In this study, we uncover the unexpected efficacy of residual-based large
language models (LLMs) as part of encoders for biomedical imaging tasks, a
domain traditionally devoid of language or textual data. The approach diverges
from established methodologies by utilizing a frozen transformer block,
extracted from pre-trained LLMs, as an innovative encoder layer for the direct
processing of visual tokens. This strategy represents a significant departure
from the standard multi-modal vision-language frameworks, which typically hinge
on language-driven prompts and inputs. We found that these LLMs could boost
performance across a spectrum of biomedical imaging applications, including
both 2D and 3D visual classification tasks, serving as plug-and-play boosters.
More interestingly, as a byproduct, we found that the proposed framework
achieved superior performance, setting new state-of-the-art results on
extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we
aim to open new avenues for employing LLMs in biomedical imaging and enriching
the understanding of their potential in this specialized domain.
♻ ☆ Coarse-Tuning for Ad-hoc Document Retrieval Using Pre-trained Language Models LREC
Fine-tuning in information retrieval systems using pre-trained language
models (PLM-based IR) requires learning query representations and
query-document relations, in addition to downstream task-specific learning.
This study introduces coarse-tuning as an intermediate learning stage that
bridges pre-training and fine-tuning. By learning query representations and
query-document relations in coarse-tuning, we aim to reduce the load of
fine-tuning and improve the learning effect of downstream IR tasks. We propose
Query-Document Pair Prediction (QDPP) for coarse-tuning, which predicts the
appropriateness of query-document pairs. Evaluation experiments show that the
proposed method significantly improves MRR and/or nDCG@5 in four ad-hoc
document retrieval datasets. Furthermore, the results of the query prediction
task suggested that coarse-tuning facilitated learning of query representation
and query-document relations.
comment: Accepted at LREC-COLING 2024
♻ ☆ Look Before You Leap: Problem Elaboration Prompting Improves Mathematical Reasoning in Large Language Models
Large language models (LLMs) still grapple with complex tasks like
mathematical reasoning. Despite significant efforts invested in improving
prefix prompts or reasoning process, the crucial role of problem context might
have been neglected. Accurate recognition of inputs is fundamental for solving
mathematical tasks, as ill-formed problems could potentially mislead LLM's
reasoning. In this study, we propose a new approach named Problem Elaboration
Prompting (PEP) to enhance the mathematical capacities of LLMs. Specifically,
PEP decomposes and elucidates the problem context before reasoning, therefore
enhancing the context modeling and parsing efficiency. Experiments across
datasets and models demonstrate promising performances: (1) PEP demonstrates an
overall enhancement in various mathematical tasks. For instance, with the
GPT-3.5 model, PEP exhibits improvements of 9.93% and 8.80% on GSM8k through
greedy decoding and self-consistency, respectively. (2) PEP can be easily
implemented and integrated with other prompting methods. (3) PEP shows
particular strength in handling distraction problems.
♻ ☆ Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram
In response to disinformation and propaganda from Russian online media
following the invasion of Ukraine, Russian media outlets such as Russia Today
and Sputnik News were banned throughout Europe. To maintain viewership, many of
these Russian outlets began to heavily promote their content on messaging
services like Telegram. In this work, we study how 16 Russian media outlets
interacted with and utilized 732 Telegram channels throughout 2022. Leveraging
the foundational model MPNet, DP-means clustering, and Hawkes processes, we
trace how narratives spread between news sites and Telegram channels. We show
that news outlets not only propagate existing narratives through Telegram but
that they source material from the messaging platform. For example, across the
websites in our study, between 2.3% (ura.news) and 26.7% (ukraina.ru) of
articles discussed content that originated/resulted from activity on Telegram.
Finally, tracking the spread of individual topics, we measure the rate at which
news outlets and Telegram channels disseminate content within the Russian media
ecosystem, finding that websites like ura.news and Telegram channels such as
@genshab are the most effective at disseminating their content.
comment: Accepted to ICWSM 2024
♻ ☆ NLP-based detection of systematic anomalies among the narratives of consumer complaints
We develop an NLP-based procedure for detecting systematic nonmeritorious
consumer complaints, simply called systematic anomalies, among complaint
narratives. While classification algorithms are used to detect pronounced
anomalies, in the case of smaller and frequent systematic anomalies, the
algorithms may falter due to a variety of reasons, including technical ones as
well as natural limitations of human analysts. Therefore, as the next step
after classification, we convert the complaint narratives into quantitative
data, which are then analyzed using an algorithm for detecting systematic
anomalies. We illustrate the entire procedure using complaint narratives from
the Consumer Complaint Database of the Consumer Financial Protection Bureau.